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Abstract 



We present the living application, a method to autonomously manage applications 
on the grid. During its execution on the grid, the living application makes choices 
on the resources to use in order to complete its tasks. These choices can be based on 
the internal state, or on autonomously acquired knowledge from external sensors. 
By giving limited user capabilities to a living application, the living application is 
able to port itself from one resource topology to another. The application performs 
these actions at run-time without depending on users or external workflow tools. 
We demonstrate this new concept in a special case of a living application: the living 
simulation. Today, many simulations require a wide range of numerical solvers and 
run most efficiently if specialized nodes are matched to the solvers. The idea of 
the living simulation is that it decides itself which grid machines to use based on 
the numerical solver currently in use. In this paper we apply the living simulation 
to modelling the collision between two galaxies in a test setup with two specialized 
computers. This simulation switces at run-time between a GPU-enabled computer in 
the Netherlands and a GRAPE-enabled machine that resides in the United States, 
using an oct-tree N-body code whenever it runs in the Netherlands and a direct 
N-body solver in the United States. 



Key words: grid workflow, multi-scale, N-body simulation, living application, 
self-organizing system 



1 Introduction 



A grid application consists of a range of tasks, each of which may run most ef- 
ficiently using a different set of resources. Most of these applications, however, 
use a fixed resource topology even though certain tasks could benefit from 
using different resources. This can be due to the computational demands of 
these tasks or due to a change in resource availability over time. A wide range 
of work has been done on developing external management systems that allow 
applications to change grid resources during execution. This includes work 



flow systems (iHerrera et al 



2005 



Ludscher et al 



or gr id schedulers with migration capabilities (IFrey et al 



2006 



Yu 



and 



2002 



Buvva, 2005) 



Allen et al. 



20011 ) that support resource switches that are either part of a predefined work- 



flow or requested by the user. 



An application manage ment system that autono mously switches at run-time 



has been proposed by 



Nascimento et al. 



(120071 ). where a hierarchically dis- 



tributed application management system dynamically schedules and migrates 
a bag-of-tasks style MPI application, using a static hierarchy of schedulers to 
accomplish this. 



A self-adaptive g rid application that doe s not require external managers has 



been presented in 



Wrzesihska et al 



(120051 ) . Although this application does not 



use grid scheduling, it is able to autonomously migrate to different locations 
and change its number of processes. This has been accomplished by allowing 
all processes to share knowledge and cooperate in managing the application's 
topology. 



In this work, we propose the living grid application, in which the application 



also decides where to run, and which is also able to migrate itself at run- 
time to another computer when needed. The intelligent migration from one 
computer to another can be realized over a long baseline, but does not need 
to be designed this way (see Sec. [2]). We then apply this method to a multi- 
scale simulation and demonstrate its working on an intercontinental grid of 
semi-dedicated computers by simulating the merging between two galaxies, 
which provides a typical example for a multi-scale simulation (Sec. [3]). In 
this simulation, we used a straightforward and autonomous resource selection 
scheme, where the optimal site is chosen from a predefined list of available 
resources. The simulation does not contain specific mechanisms to ensure fault 
tolerance or fault recovery. 

2 Living Application 

2. 1 Rationale 

A flexible approach is needed to execute a complex grid application with 
multiple tasks and a diverse palette of resource requirements. The application 
should then be able to switch between tasks at run-time and between the 
resources required for each of these tasks, while maintaining the integrity of 
its data during these switches. 

A switch requires the application to terminate its current execution, output its 
current state, and from that reinitialize the application using a new resource 
topology suited for the task at hand. Previously this has been done on a grid 
only in orchestration with a workflow manager. A job submitted by a workflow 
manager lacks the ability to change its resource topology during execution, as 



it does not have the privileges to make use of grid schedulers. When running 
an application with multiple tasks, this results in a 'bouncing' pattern where 
the manager submits jobs which return once a switch is required, only to be 
instantly submitted again to handle a different task. In the most favorable 
case, the performance loss introduced by bouncing and managerial overhead 
can be limited, but even then the successful completion of the simulation 
depends on the availability of an external manager, which is a potential single 
point of failure. 

2.2 How the living application works 

The living application switches between sites and tasks dynamically and with- 
out external dependencies. It is based on four principles: 

(1) It makes decisions on which tasks to do and which resources to use. 

(2) It makes these decisions based on knowledge it has acquired at run-time. 

(3) It changes resources and switches between tasks. 

(4) It operates autonomously. 

As a living application operates autonomously on the grid, it obtains its priv- 
ileges on its own without interacting with an external workflow manager or 
user. 

Upon initialization, the application is locally equipped with the tools and 
data to perform the required tasks and the criteria for switching between 
tasks or resource topologies. It is then submitted as a job to the grid with the 
initial resource requirements defined by the launcher. The living application 
begins execution on the grid and continues to do so until either a switch or a 



termination is required. 

The conditions for switching or termination are determined prior to the start 
of the calculation or during run-time, but they are not necessarily static. They 
can rely on the internal state of the application, or on information from exter- 
nal sensors. When the conditions for a switch have been met, the application 
will migrate to different grid resources, switch to a different task, or both. 

The switching between tasks requires two steps, which are finalizing the old 
task (and any program it still uses) and starting up the new task. During 
this switch, the application-specific data should be left intact. The switching 
between sites requires a larger number of actions, which are: 

(1) Creating a set of files consisting of the current application, files with 
its parameters and data and a script that specifies the methods and 
conditions for switching and termination. 

(2) Creating a job definition for the application on the new resources. 

(3) Authenticating (independently) on the grid. 

(4) Transferring the files to the remote site (if this is not done automatically 
by a resource broker). 

(5) Submitting the job, either through a resource broker or by directly ac- 
cessing the head nodes of grid sites. 

(6) Reinitializing the living application on the new site. 

Additional file transfer may be required, if the application has locally written 
data that is required elsewhere. The application could initiate the transfer of 
output files either during run-time (e.g. if separate files are written) or just 
before a job terminates on one machine (if data is appended to a single large 
file or data transfer would cause overhead at run-time). 



The living application requires some user privileges to initiate data transfers 
and to autonomously migrate from one site to another. We obtain these privi- 
leges by using a grid client interface to access a credential management service. 
The details of this method are discussed in Sec. 12.2.11 The application requires 
access to the grid client interfaces on all participating nodes to request these 
privileges during execution. Once these privileges are granted, the application 
can perform authentication, data transfers and job submissions to the grid. 



2.2.1 Security Considerations 



User privileges on the grid are provided by an X.509 grid proxy (j Welch et al. 



20041 ) which requires the presence of a certificate, a private key and a correct 
pass phrase typed in by the user. This proxy is represented by a temporary 
file with limited lifetime. The easiest way to provide user privileges to a living 
application would be to equip it with this file, transporting it as it migrates, 
allowing it to reuse the proxy on remote locations. However, this approach has 
three drawbacks: 

First, the presence of a proxy file on a remote site poses a security risk. If 
the file is not read-protected or stored in a shared account, it may be possible 
for other grid users to copy the proxy. The possession of this proxy enables 
them to impersonate the living application user for the duration of the proxy's 
lifetime, providing them with rights and resources that they could otherwise 
not use. Even if the proxy is on a dedicated account and read-protected, local 
users with admin rights are able to copy it and use it for impersonation. 



Second, it is not possible to cancel the application after the first stage, as the 
proxy is initialized only at startup, after which it travels around on remote 



sites. This may cause a malfunctioning application to continue running and 
migrating until the proxy lifetime is exceeded. An application that is equipped 
for self-reproduction may iteratively spawns multiple successors which could 
lead to a grid meltdown. 



Third, for the same reasons as before it is also not possible to prolong the life- 
time of the proxy. This could cause the application to terminate prematurely 
once the proxy lifetime is exceeded. Specifying an excessively long lifetime 
relieves this problem, at the expense of increasing exposure to the other two 
drawbacks. 



To red uce these drawbacks we have chosen to use an intermediary MyProxy 



server (iBasney et al 



20051 ) in our implementation. The user initializes his or 
her proxy on the MyProxy server, which is encrypted using a unique password. 
This password is stored in the living application, which uses it to obtain short- 
lived user privileges from the MyProxy server. If the password is stolen, others 
may be able to get these short-lived privileges, but the user can remove access 



to these privileges at any time by destroying the credential. 



During application execution, the user can also extend the lifetime of his 
MyProxy credential by renewing it. It is also possible to replicate the creden- 
tials to other MyProxy servers, which allows the application to use remote 
MyProxy servers if the local server has died, rather than terminating itself 
upon switching. 



2.3 Living Simulation 



A special case of the living application is the living simulation. Today, sim- 
ulations of complex systems, in which the dynamic range exceeds the stan- 



dard precision o 



( iHoekstra et al. 



the computer, call for a wide range of numerical solvers 



20081 ) . Each of these solvers may run most efficiently on a 
different computer architecture. Most such simulations, however, are run on a 
single computer even though they would benefit from running on a variety of 



architectures. 



This can be solved by migrating the application at run-time from one computer 
to another, in other words, by creating a living simulation. Such a simulation 
loads the solvers as a library module and is able to probe the internal variables 
of these solvers, making migration decisions based on this information. We 
demonstrate the concept of the living application by applying it to the (living) 
simulation of two galaxies merging. 



The term living simulation has been previously defined as simulations that 
fine-tune their behavior at run-time based on input from e xternal sensors, 



e.g. t o provide input for performing adaptive load balancing (IKorkhov et al. 



20081 1 . In our definition we provide the simulation with user privileges and 



expect it to function autonomously. 



3 Simulating galaxy mergers as a living simulation 

3. 1 Motivation 

A living simulation is based on the principle that it autonomously switches 
between sites and solvers whenever required. This switching is done dynam- 
ically and without external dependencies. The simulation is locally equipped 
with the required solvers, the switching criteria and the initial conditions. It 
is then submitted as a job to the grid with the initial resource requirements 
defined by the launcher. The living simulation begins calculating on the grid 
and continues to do so until either a switching condition or a termination 
condition has been met. 

By using the idea of the living applications, we have implemented and tested 
a living simulation, in which the merger of two galaxies, each with a central 
supermassive black hole (SMBH), is simulated. This is a computationally ex- 
pensive problem which requires integration with high accuracy during close 
encounters and in the final stages of merging, i.e. whenever the two SMBHs 
come close to each other. At an early phase and at large separation of the 
two galaxies, however, less accurate and therefore faster integration methods 
are sufficient. We improve the performance and the dynamic range of the tree 
code simulations (which are typically the method of choice for galaxy merger 
simulations) by hybridizing the tree code with a direct iV-body solver. 

In the scenario we are modelling, the two galaxies are initially well separated 
by hundreds of kiloparsec, but they approach each other on a bound orbit. 
Dynamical processes lead to a redistribution of energy and momentum which 



Fig. 1. Simulation snapshot of a 260k particle simulation, where the two galaxies 
approach for an initial interaction. 

causes, among other things, the formation of tidal tails (see Fig[T]). Eventually, 
these dynamical processes lead to the merger of the two galaxies. 

In this merger, the two SMBHs, which reside in the galaxy cores, will be 
brought close together until they form a binary SMBH. Modelling the details 
of the formation of a binary SMBH and its subsequent evolution requires a 
very accurate integration. Therefore, we choose to switch from the tree code to 
a direct i V-body solver at a prespecifie d separation r a between the two SMBHs 



fsee also 



Portegies Zwart et al. 



(120081 )). The switching allows us to follow the 
full galaxy merger. This would not be possible using a single solver due to the 
limited accuracy of the tree code and the computational costs of the direct 
method. 



In our living simulatio n, we make use of a dedicated GRAPE (GRAvity PipE, 
Sugimoto et al.l (119901 )) special purpose computer to perform direct-method 
integration, and a graphics processing unit (GPU) to perform tree simulations. 



The living simulation initially integrates using a tree code on a GPU node, 
but switches to a direct integrator on a GRAPE node when the separation 
between the two SMBHs tsmbh < r a- The simulation switches back to the 
GPU once r SM BH > r a . 



3.2 Implementation 



We h ave used the Multiscale Software Environment (MUSEj^J (IPortegies Zwart et al 







20081 ) package to conduct our simulations. MUSE is a multi-scale/multi-physics 
astrophysical framework that connects a variety of astrophysical codes, en- 
abling users to create combined simulations using Pytho n scripts . The inter- 



19961 ) with 



facing between existing solvers is realized using SWIG (iBeazleyl . 
a uniformly defined interface for each domain. By writing scheduling scripts, 
users are able to access the different interfaces and create simulations that use 
multiple solvers for a wide range of astrophysical problems. 

The modular approach of MUSE lends itself very well to the grid architec- 
ture. Modules run independently of each other and communicate through the 
scheduling script. A grid-enabled scheduler would then send each module to 
a different, suitable machine on the grid. Furthermore, many astrophysical 
solvers run most efficiently on dedicated and specialized computers. GRAPE 
boards, for example, have been used extensively and very successfully in the 



field of stellar dynami c s (e.g . 



2006 



Baumgardt et al. 



2003 



Gualandris and Merritt 



2008 



Portegies Zwart and McMillan 



Berczik et al. 



2002). In many 



MUSE application requires one or more specialized platforms to run 
on and is therefore best run on a grid of such specialized computers. 



1 see http://muse.li 



In previous work (IPortegies Zwart et all 120081 ) we have extended MUSE with 



a grid interface, allowing users to transfer files and perform simulations on 
remote grid sites using a static and centralized scheduler which runs on the 



local user machine. 



the PyGlobus API f lJacksonl . 



he grid i nterfa ce has currently been implemented using 



2002l ) . and an alternative DRMAA-compliant 



interface is under development. 

Our test implementation consists of two components, a launcher to initial- 
ize the living simulation and a job script that travels over the grid during 
simulation. The launcher: 

• Loads MUSE and the required modules, 

• reads the simulation input, 

• stores the parameters for each solver and the initial data for the first simu- 
lation stage, 

• transfers these files to the remote site, and 

• submits the job script as a grid job to the remote site. 

The living simulation grid job executes the Python job script, which: 

• Initializes the simulation that will be used, 

• reads and writes solver parameters and snapshots, 

• uses MUSE and SWIG to execute a simulation, 

• transfers files, and 

• submits a job script that computes the next simulation stage. 



The job script is able to periodically check internal variables of the local solver 
at run-time using MUSE and SWIG. Consequently, the script is sensitive to 
changes in these variables, and autonomously performs actions (e.g. migration 



to a different site or file transfers) if certain conditions are met. 



3.3 Experiment setup 



For our experi ments we make use of tw o grid nodes, one node equipped with 



a GRAPE-6Af flFukushige et al. 



20051 ) at Drexel University in Philadelphia, 



United States and one node with an Nvidia 8800 Ultra GPU at the University 
of Amsterdam in the Netherlands. The GRAPE-6Af has a peak performance 



brmance of up to ~85 



Gflops 



2003 1. The 



of approximately 123 Gflops and an effective perf 
when performing a direct-method simulation (jFukushige et al. 
Nvidia 8800 Ultra has a theoretical peak performance of about 384 Gflops 
and a sustained performance of up to ~100 Gflops when performing a N-body 
tree simulation using octgrav (E. Gaburov, personal communication). The 
specification of the nodes can be foun d in Table [Tj On both nodes we have 



200J) with GRAM, GridFTP 



installed Globus 4.0.6 grid middleware (IFosterl . 
and a MyProxy client, as well as the MUSE framework. The nodes are linked 
using a regular internet connection for which we have measured a latency of 
100ms and a bandwidth of approximately 550 kB/s. 

On these nodes we run galaxy collision simulations (usi ng simplified galaxy 



mode ls, see below) that each last for 20 N-body time units (IHeggie and Mathieu 



19861 ). In all our runs, this duration was sufficient to perform a full collision 



between the two galaxies. 



The initial conditions for the gala xy collision consi st of two equally-sized Plum 



mer sphere particle distributions (jPlummer 



19111 1 , each of which has a central 



ticlefl 



SMBH. We perform simulations with N = 2k to 64k particled__|. The total 
mass of particles in each galaxy is M — 1 and the mass of individual particles 
is m = M/N. The SMBHs have each a mass of mgn = 0.01 or 1% of the 
stellar mass of the galaxy. 



When the two galaxies are far apart we use the tree-code (IBarnes and Hut 



19861 ) in which further away particles are grouped together to enable a hierar- 



chical reduction in the force computation. The equation s of motion are solved 



] 

using the 2 nd order leap-frog particle integration scheme (iHockney and Eastwood! . 



19881 ) with a fixed time step. The octgrav tree code we use is written to run 
on a graphical processing unit (Gaburov et. al., 2009, in preparation). The 
opening angle for the tree code is 9 = 0.7 and we use a time step of 1/64 N- 
body time unit (1/128 for the la rgest data set ) . The direct-method integration 



20071 ). In phiGRAPE, particles 



is performed using phiGRAPE (lHarfst et al. . 
have in dividual (block) time steps and the time step parameter rj was set 



to 0.02 (jMakino and Aarseth 



19921 ). We also defined a maximum time step 
of 2 -5 and a minimum time step of 2~ 23 N-body time units. A softening of 
e = 0.01 is used in both integration methods. 



We have performed two profiling experiments, using direct integration when- 
ever the separation of the central black holes was less than r a , and tree at all 
other times. The first experiment varies in the number of simulation particles, 
while maintaining r a = a/0.3. The other experiment uses 32k particles and a 
different r a for each run. For comparison, we have also included a full tree and 
a full direct run. 



2 i.e. 1024 to 32768 particles per galaxy as well as 2 SMBH particles. 



3.4 Results 



We have summarized the results of our living simulation in two figures. The 
absolute time spent on each task as a function of the number of particles is 
given in Fig. [21 and the relative time share of each task is shown in Fig. [3j 
For all the tested initial conditions, the simulation migrated itself three times, 
resulting in four initializations and three simulation migrations per run. 

In this experiment, we find that the direct N-body integration dominates the 
simulation performance in all cases, and that for larger N, the relative over- 
head caused by grid data transfers and job submissions diminishes. Although 
the time spent on local I/O scales steeply due to unoptimized identifier lookup 
calls (this has recently been fixed in MUSE), this overhead remains relatively 
small throughout our runs. When using 64k particles, we found that ~ 4 
percent of the simulation time is spent on overhead tasks. 

We have performed several runs with 32k particles, using a different r a for 
each run. The results of this experiment are shown in Tab. [2j During the runs 
we observed several close interactions between the SMBHs, and a decreasing 
trend in the value of ?"smbh- This behavior caused the living simulation runs 
with smaller r a to switch more frequently. 

A pure tree integration (r a = 0) leads to the highest cumulative energy er- 
ror, whereas a pure direct integration (r a = oo) has the lowest error. When 
switching between both codes with the living simulation, the energy error is 
lower than using pure tree, but much higher than using a direct code. Even 
when using a r a = \/l0, where the code switches only once after 4 N-body 
time units, we see a much larger error than when using only direct. The energy 
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Fig. 2. Timing measurements of the living simulation tasks as a function of the 
number of simulated particles. The two solid lines represent time spent on direct 
integration (bullets) and tree integration (circles). The thick dashed lines indicate 
grid overhead by data transfers (open squares) and job submissions (filled squares). 
Finally, the two thin dashed lines indicate overhead caused by local file I/O (open 
triangles) and code initializations (filled triangles). 



error is dominated by the execution of the tree code. This difference is caused 
by the tree-based force calculation as well as by the second-order leapfrog in- 
tegration scheme used in the tree code. A detailed discussion on the energy 
behavior of these combined simulations can be found in Harfst et. al. (2009, 
in preparation). 

The simulation performance is dominated by N-body integration in all cases, 
although there is a relatively high overhead for r a = 0.1, which is caused 
by the 29 switches. Each of these switches requires the particles to be saved 
locally, sent across the Atlantic using regular internet, and loaded on the new 
machine. 



initialize I data transfer I tree I 

overhead i I/O i direct 




Fig. 3. Relative cumulative share of time spent by the living simulation tasks as a 
function of the number of simulated particles. From top to bottom the areas refer 
to the share of time spent on direct integration, tree integration, local file I/O, grid 
data transfer, grid job submissions and simulation initializations. Note that both 
axes are in log-scale. 

4 Conclusion 

We introduced the living application as a way to manage complex applica- 
tions on a large distributed infrastructure. Due to the autonomous nature of 
a living simulation, it is important to provide a mechanism that allows the 
user to terminate it. By having the simulation retrieve its extended privileges 
from a credential management service (MyProxy), users are able to revoke 
the privileges of the simulation regardless of its location. In addition, we can 
renew short-lived proxy credentials instead of using a long-lived credential, 
which may be attractive to malicious users. 

We then apply this concept in a living simulation of two galaxies merging, us- 
ing a straightforward and autonomous resource selection scheme which chooses 
from a predefined list of available resources. Our approach allows the simula- 



tion to use the optimal compute resources for each of the two solvers, switching 
resources whenever a different solver is required. In our example, the solvers 
were a tree code and a direct iV-body method, which were optimized for two 
kinds of special-purpose hardware, namely a GPU (tree) and a GRAPE (di- 
rect). The switches take place autonomously without user intervention, remote 
output retrieval or external managers. In our experiments, the execution time 
was only affected marginally by overhead such as caused by job migration and 
data transfer over the grid. In the cases where each solver is best run on a 
different architecture and the overall simulation performance is not dominated 
by switching overhead, we find that the living simulation is a practical and 
resource efficient solution. 

The creation of grid species enables us to give a simulation the ability to au- 
tonomously use the grid, acquire and apply internal knowledge, and migrate 
themselves. In this work we presented a first implementation, which we intend 
to extend in the near future. Possible extensions include connecting living ap- 
plications with grid resource monitoring and discovery services to dynamically 
obtain information on resource availability, and developing a living application 
which is able to recover from failures of grid nodes. These extensions allow us 
to apply the living application to evolve to a more complex organism, which 
can be applied to problems of greater complexity. 
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Table 1 

Specifications for the test nodes. The first column gives the name of the computer 
followed by its country of residence (NL for the Netherlands, US for the United 
States). The subsequent columns give the type of processor in the node, followed by 
the amount of RAM, the operating system, and the special hardware installed on 
the PC. Both nodes are connected to the internet with a 1 Gbit/s Ethernet card. 



name 


location 


CPU type 


RAM 


OS 


hardware 








[MB] 






darkstar 


NL 


Core2Duo 3.0GHz 


2048 


Debian 


Nvidia 8800 Ultra 


zonker 


US 


2x Xeon 3.6GHz 


2048 


Gentoo 


GRAPE 6A 



Table 2 

Timing and energy measurements of the living simulation tasks using 32k particles 
with a different value r a during each run, given in the first column. The second 
column gives the number of switches during the simulation, while the subsequent 
columns respectively give the times spent on direct integration, tree integration and 
overhead tasks. The total execution time and the total relative energy error are 
respectively given in the last two columns. 



r a 


# switches 


direct 


tree 


other 


total 


dE/E 






N 


N 


[s] 


N 




0.0 (tree) 








247 


24 


271 


1.47- 10~ 2 


0.1 


29 


762 


219 


944 


1925 


5.93 • 10~ 3 




7 


1820 


160 


257 


2237 


3.54 • 10~ 3 




3 


2180 


143 


120 


2443 


2.88 • 10~ 3 


1.0 


3 


2519 


127 


118 


2764 


2.49 • 10~ 3 




1 


3624 


64 


54 


3742 


1.04 • 10~ 3 


00 (direct) 





4528 





5 


4533 


2.77 • 10~ 6 



